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SUMMARY 

Model  determination  is  a  fundamental  data  analytic  task.  Here  we  consider  the 
problem  of  choosing  amongst  a  finite  (with  loss  of  generality  we  assume  two)  set  of  models. 
After  briefly  reviewing  classical  and  Bayesian  model  choice  strategies  we  present  a  general 
predictive  density  which  includes  all  proposed  Bayesian  approaches  we  are  aware  of.  Using 
Laplace  approximations  we  can  conveniently  assess  and  compare  asymptotic  behavior  of 
these  approaches.  Concern  regarding  the  accuracy  of  these  approximation  for  small  to 
moderate  sample  sizes  encourages  the  use  of  Monte  Carlo  techniques  to  carry  out  exact 
calculations.  A  data  set  fit  with  nested  non  linear  models  enables  comparison  between 
proposals  and  between  exact  and  asymptotic  values. 


Key  words:  Bayes  factor,  Laplace  approximations,  Likelihood  ratio  statistics,  Monte  Carlo 

methods. 
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1.  INTRODUCTION 

Model  determination  is  a  fundamental  data  analytic  task.  It  is  also  a  complex 
matter  influenced  by  intended  use  for  the  model  as  well  as  perspective  regarding  the  class 
of  models  being  entertained.  Thus,  it  is  not  surprising  that  no  widely  accepted  strategy  for 
building  models  has  emerged. 

Here  we  consider  a  much  less  ambitious  problem,  that  of  choosing,  within  a 
Bayesian  modeling  framework,  amongst  a  finite  specified  set  of  models.  Such  selection  is 
based  upon  predictive  distributions  and  we  provide  a  general  predictive  formulation  which 
includes  all  proposed  solutions  we  are  aware  of.  By  employing  Laplace  approximations 
(see,  e.g.,  Lindley  1980,  Tierney  &  Kadane  1986)  we  can  carry  out  asymptotic  calculations 
to  conveniently  assess  and  compare  asymptotic  behavior  of  these  proposals  and  develop 
tie-ins  with  classical  likelihood  ratio  based  strategies.  We  shall  then  describe  how 
simulation  based  techniques  can  be  employed  to  perform  exact  calculations.  In  this  regard, 
we  wish  to  use  the  output  of  recently  discussed  Markov  Chain  Monte  Carlo  methods  for 
Bayesian  computation  which  supply  samples  essentially  from  the  posterior  distribution. 
We  shall  also  present  some  comparison  between  proposals,  and  between  exact  and 
asymptotic  values  through  the  fitting  of  nested  non  linear  models  to  a  data  set  consisting  of 
57  points  given  in  Bates  and  Watts  (1988). 

The  literature  on  Bayesian  model  choice  is  considerable  by  now.  It  begins  with  the 
formal  Bayes  approach  which,  in  the  case  of  two  models,  results  in  the  Bayes  factor. 
Subsequent  work  has  proposed  modified  Bayes  factors.  A  current,  reasonably  thorough 
review  appears  in  Gelfand,  Dey  and  Chang  (1992)  and  its  attendant  discussion.  Our  hope 
in  this  paper  is  to  achieve  some  unification  of  this  substantial  literature. 

We  note  that  all  of  this  work  presumes  that  choice  is  made  by  reducing  each  model 
to  a  single  summary  number  and  then  comparing  these  numbers.  Such  severe  data 
reduction  only  permits  model  comparison  in  aggregate;  observation  or  case-level 
diagnostics  enable  a  clearer  comparison  of  model  performance.  For  work  of  this  sort  in  the 
Bayesian  context  see  Geisser  1988,  Pettit  and  Young  1990,  Gelfand,  Dey  &  Chang  (1992). 

The  outline  of  the  paper  is  thus  the  following.  In  Section  2  we  review  the  behavior 
of  the  likelihood  ratio  statistic  and  well  known  adjustments  to  it.  In  Section  3  we 
summarize  the  Bayesian  view  of  model  choice  through  predictive  distributions  and  the 
Bayes  factor.  This  motivates  a  general  definition  of  predictive  densities  in  Section  4  which 
includes  known  cases  in  the  literature  and  yields  a  variety  of  alternative  Bayes  factors. 
Laplace  method  asymptotics  for  general  predictive  densities  are  discussed  in  Section  5  and 
illustrated  for  various  Bayes  factors  in  Section  6.  Concern  for  the  accuracy  of  the  Laplace 
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approximations  as  well  as  the  limitations  of  these  approximations  leads  us,  in  Section  7,  to 
seek  arbitrarily  accurate  exact  calculations  using  Monte  Carlo  methods.  Approximate  and 
exact  calculations  are  applied  to  a  nested  pair  of  nonlinear  models  in  Section  8.  We 
conclude  with  a  few  summary  remarks  in  Section  9. 

2.  CLASSICAL  APPROACHES 

In  what  follows,  we  shall  assume  a  choice  between  two  parametric  models  denoted 
interchangeably  by  joint  density  f(y|  0.;  M.)  or  likelihood  L(0.;  y,  M-),  i  =  1,  2  where  y 

is  n*l  and  0-  is  p.*l.  Since  the  model  choice  techniques  we  consider  reduce  models  to 

single  summary  numbers  overall  selection  can  be  made  through  such  pairwise  comparision. 
Also,  in  practice  models  are  typically  considered  in  pairs,  i.e.,  model  exploration  is  often 
evolutionary,  modifying  a  current  model  to  see  if  improvement  ensues.  Moreover,  classical 
Neyman— Pearson  theory  for  testing  of  models  requires  pairwise  processing. 

Indeed,  a  few  remarks  on  classical  approaches  for  model  choice  seems  an  appropriate 
starting  point.  Informal  procedures  are  generally  based  upon  predictive  performance  in  the 
form  of  comparison,  in  some  fashion,  of  distances  between  observed  values  and  values 
predicted  under  a  given  model.  Occasionally  certain  optimalities  can  be  ascribed  to  such 
procedures.  Implementation  requires  "fitting”  the  model.  Following  Neyman— Pearson 
theory  suppose  we  create  the  hypotheses  H-:  data  y  arise  from  model  M-,  i  =  1,  2  and 

set,  say  as  the  null  hypothesis.  If  the  M-  are  completely  general  there  is  no  optimal 

test  of  Hj  vs  H2  unless  both  models  are  fully  specified.  The  formulation  of  a  likelihood 

ratio  test  requires  an  unambiguous  specification  of  a  null  and  alternative  hypothesis  such  as 
in  the  nested  models  case  where  M^  is  the  reduced  model  and  M2  is  the  full  model.  The 

likelihood  ratio  test  then  takes  the  form:  reject  Hj  if  <  c  <  1  where 

“  U>2\  7,  Mj) 


We  assume  here  and  in  the  sequel  a  regular  case,  i.e.,  the  p.  remain  fixed  as  n  -«  ® 

2 

whence,  under  mild  conditions,  -  21og  is  approximately  distributed  as  xt  _D  under 
Hj.  With  this  approximation,  inconsistency  of  the  likelihood  ratio  test  arises,  that  is, 
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=  1  im  P(-  21og  A  >  -  21og  c)  =  P(^  >  -  21og  c)  >  0. 

n-.®  n  p2  P1 

In  other  words,  AQ  tends  to  be  too  small;  the  likelihood  ratio  test  gives  too  much 

weight  to  the  full  model.  As  a  result  numerous  authors  (see  below)  have  proposed 
penalizing  the  likelihood  in  the  form  log  L(0.;  y,  M-)  -  k(n,  p^)  where  k(n,  p)  >  0,  and 

increasing  in  n  and  p.  Hence,  the  full  model  will  be  penalized  more  than  the  reduced 
model.  We  replace  log  Afl  by  the  larger  quantity  log  An  +  k(n,  p2)  -  k(n,  p1).  For  the 

above  inconsistency  to  vanish  we  need  k(n,  p2)  —  k(n,  Pj)  -*  ®  as  n  -*  ®.  The  form 

k(n,  p)  =  «p  is  most  common  in  the  literature  (though  it  does  not  eliminate 
inconsistency).  Values  for  «  in  the  interval  1  <  «  <  2.5  appear  in  e.g.  Akaike  (1973)  and 
in  Bhansali  and  Downham  (1977).  Nelder  and  Wedderbum  (1972)  suggest  «  =  -j.  Aitkin 

(1991)  suggests  oe  =  log  2.  Choices  of  k  which  depend  upon  n  and  do  eliminate 
inconsistency  include  k(n,  p)  =  £  log  11  (Schwarz,  1978),  k(n,  p)  =*  p  /?  log  (log  n),  /?  >  2 

(Hannan  and  Quinn,  1979)  and  k(n,  p)  =  n  log(n  +  2p)  (Shibata,  1980). 

3.  THE  BAYESIAN  FORMULATION 

The  Bayesian  model  adds  a  prior  specification  r{ff)  to  the  likelihood  specification. 
Inference  is  based  upon  the  posterior  distribution  ir(0|y)  «  L(0;  y)  •  r{0).  For  Bayesian 
model  choice  the  two  model  components  may  be  varied.  The  case  where  L  is  held  fixed 
and  r  is  varied  to  assess  the  sensitivity  of  the  posterior  to  such  prior  variation  is  referred 
to  as  Bayesian  robustness  (Berger  1984,  1985).  Our  intent  here  is  to  parallel  Section  2, 
hence  to  vary  L.  As  such  we  will  assume  "noninformative"  priors. 

The  formal  Bayesian  model  choice  procedure  goes  as  follows.  Let  w.  be  the  prior 

probability  of  M-,  i  =  1,  2  and  f(y|Mj)  the  predictive  distribution  for  model  M-,  i.e., 

=  lf(y|0j>  Mj). 


If  y^  denotes  the  observed  data  then  we  choose  the  model  yielding  the  larger 
wi  f(y0bslMi)-  K  wi  ~  ^  we  use  the  Bayes  factor  (of  with  respect  to  M2) 


Jeffreys  (1961),  (see  also  Pettit  and  Young,  1990)  suggests  interpretive  ranges  for 
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the  Bayes  factor.  Foundational  arguments  (see,  e.g.,  DeGroot,  1970)  insist  that  the  only 
way  to  compare  models  is  through  the  "probabilities  of  these  models"  and  hence,  with  two 
models,  through  the  ratio.  It  is  noteworthy  that  the  Bayes  factor  employs  no  presumption 
of  nesting  and  does  not  require  "fitting". 

In  fact,  presuming  (2)  cam  be  calculated,  why  would  one  seek  alternatives?  One 
criticism  is  that  if  ir(d)  is  improper  (as  it  usually  will  be  under  noninformative 
specification)  then  f(y)  is  as  well.  Hence,  we  can  not  interpret  the  f(y|M.)  as  the 

"probabilities  of  these  models"  nor  can  we  interpret  the  ratio.  Several  authors  have 
attempted,  primarily  in  the  context  of  normal  data  models,  to  develop  a  multiplier  for  BF 
to  overcome  this  problem  (Smith  and  Spiegelhalter,  1980;  Spiegelhalter  and  Smith,  1982; 
Pericchi  1984).  Pericchi  suggests  the  essence  of  the  problem  is  that,  for  a  given 
experiment,  the  expected  increase  in  information  about  model  parameters  varies  with  the 
specification  of  the  model  and  that  the  multiplier  should  neutralize  this  differential. 

Closely  related  to  this  is  the  fact  that  even  under  proper  priors  with  arbitrarily  large 
sample  sizes  the  Bayes  factor  tends  to  attach  too  little  weight  to  the  correct  model.  An 
illustration  is  the  well  known  Lindley  paradox  dating  at  least  to  Bartlett  (1957).  In  the 
nested  model  case,  under  usual  regularity  conditions,  we  shall  show  in  Section  6  that  BF  -» 
a  as  n  -*  n.  In  other  words,  regardless  of  the  data,  as  n  grows  large,  model  M1  will  be 

chosen.  The  behavior  of  BF  contrasts  strikingly  with  that  of  An  which  provides  too 

much  support  for  Mj;  BF  is  qualitatively  larger  than  This  comparison  is  quantified 

in  an  asymptotic  sense  in  Section  6. 

4.  GENERAL  PREDICTIVE  DENSITIES 

The  use  of  predictive  distributions  in  some  form  has  long  been  recognized  as  the 
correct  Bayesian  approach  to  model  determination.  In  particular,  Box  (1980)  notes  the 
complementary  roles  of  the  posterior  and  predictive  distributions  arguing  that  the  posterior 
is  used  for  "estimation  of  parameters  conditional  on  the  adequacy  of  the  model"  while  the 
predictive  distribution  is  used  for  "criticism  of  the  entertained  model  in  light  of  the  current 
data".  In  examining  two  models,  it  is  clear  that  the  predictive  distributions  will  be 
comparable  while  the  posteriors  will  not. 

Box  and  others  have  encouraged  a  less  formal  view  with  regard  to  Bayesian  model 
choice  resulting  in  alternative  predictionist  criteria  to  the  Bayes  factor.  Using  cross 
validation  ideas  (Stone,  1974;  Gasser,  1975)  the  pseudo  Bayes  factor  (PsBF)  arises 
(Geisser  and  Eddy,  1979).  Aitkin  (1991)  proposed  the  posterior  Bayes  factor  (PoBF)  while 
recent  work  of  Berger  and  Pericchi  (1992)  introduces  the  intrinsic  Bayes  factor  (IBF).  All 
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are  intended  to  address  the  aforementioned  problems  associated  with  the  Bayes  factor. 

The  underlying  suggestion  is  that  we  adopt  a  broader  notion  of  predictive 
distributions  and  densities.  In  fact,  we  shall  say  that  a  predictive  density  arises  by 
averaging  a  density  defined  over  some  portion  of  the  sample  space  (arising  from  the 
likelihood)  with  respect  to  a  distribution  on  the  parameter  space  (arising  from  a 
data— based  updating  of  the  prior).  We  assume  that  the  data,  y,  consists  of  a  set  of 
conditionally  independent  univariate  observations,  yj  given  the  parameter  ft  However, 

minor  modifications  (replacing  marginal  densities  for  the  y.  given  0  with  appropriate 

«# 

conditional  densities)  permit  the  handling  of  more  general  models. 

We  introduce  some  notation.  Let  y.,  j  =  l,...,n  be  a  sequence  of  independent 

J 


observations  which,  under  model  M.  have  density  f(yj|  ft,  M-),  i  =  1,  2.  Let  Jq  denote 
the  set  {1,  2,...,n)  and  let  S  be  an  arbitrary  subset  of  J  .  Define  L(ft;  y-,  M)  = 

II  1  u  1 

n  d. 

II  {f(yJ0-,  M-)}  J  where  d-  =  1  if  jeS,  =  0  if  j  t  S.  Finally,  let  r.(ft),  i  =  1,  2  be 
j=l  J  *  *  J  *  * 


the  prior  density  for  ft  under  model  Mj  with  respect  to  Lebesgue  measure. 


Consider  the  formal  conditional  density 


i(ySil*S2,  Mj)  =  Mj)  *((*,1 


ys  '  Mi)  •  L(*,:  -  Mi)  *i(  W 

X  2 

Hut, ;  js  .  Mj)  Ti(«j)d(. 

2 


(3) 


for  Sp  Sj  arbitrary  subsets  of  JR.  The  form  (3)  defines  a  predictive  density  which 

averages  the  joint  density  of  yc  with  respect  to  the  prior  for  ft  updated  by  yc  .  We 

51  1  *2 


take  yg 


to  be  a  subset  of  y  since  for  model  choice  we  want  a  numerical  value  for  (3). 


Examples  of  (3)  in  the  literature  include: 

(i)  Sj  =  Sj  —  <t>  which  yields  the  standard  predictive  or  marginal  density  of  the 

data.  The  denominator  integral  is  ignored  in  this  case. 

(ii)  Sj  *  {r},  S2  *  Jn  -  {r}  which  yields  the  cross-validation  density  f(yr|y^,  Mj) 

where  y^  =  (y1,  y2--,yr_1»  yr+i»-»yn),  as  in  Stone  (1974)  or  Geisser  (1975). 
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(iii) 

(iv) 

(v) 


f(yr|y(r),  M.)  evaluated  at  the  observed  data  is  called  the  conditional  predictive 

n 

ordinate  (CPO),  dating  to  Geisser  (1980).  The  form  n  f(yr  t  y(r)>  has  been 

proposed  as  a  surrogate  for  f(y)  by  Geisser  and  Eddy  (1979). 

Sx  a  small  subset  of  Jn,  usually  two  or  three  elements,  S2  =  Jn  —  extending 


the  single  point  deletion  in  (ii),  as  in  Pena  and  Tiao  (1992). 

=  Jn,  S2  =  Jn  which  yields  Aitkin’s  (1991)  posterior  predictive  density.  Aitkin 

argues  that,  unlike  f(y)  which  results  from  averaging  the  joint  density  of  y  with 
respect  to  the  prior,  one  should  average  with  respect  to  the  posterior. 

Sj  =  Jn  -  S2,  S2  =  {1,  2,...,[pn]}  where  [■]  denotes  the  greatest  integer  function. 

The  idea  here  is  that  a  proportion  p  of  the  observations  be  set  aside  for  prior 
updating  with  the  remainder  to  be  used  for  model  determination.  Such  an  idea  was 
suggested  by  Atkinson  (1978)  and  by  O’Hagan  (1991). 

(vi)  =  Jn  —  S2>  S2  is  a  minimal  subset  (Berger  and  Pericchi,  1992)  i.e.,  the  least 

number  of  data  points  such  that  7r-(^  jy^  )  is  a  proper  density.  In  the  regular 

2 

case,  the  dimension  of  S2  is  fixed  regardless  of  n.  Then  f(yg  |yg  ,  Mj)  is  proper 

1  2 

with,  as  we  will  see,  with  the  same  asymptotic  behavior  as  f(y;  Mj). 

Notice  that,  as  n  -*  od,  (i)  and  (vi)  are  qualitatively  different  from  the  rest;  for  (ii) 
through  (v)  the  cardinality  of  S2  approaches  a  as  n  -*  od.  That  is,  for  (ii)  through  (v), 


we  are  averaging  against  a  distribution  over  the  parameter  space  which,  with  increasing 
sample  size,  places  its  mass  where  0-  must  be.  Hence,  it  is  not  surprising  that  (i)  and  (vi) 

exhibit  different  asymptotic  behavior  from  the  rest. 

The  predictive  densities  (i) ,  (ii),  (iv)  and  (vi)  have  been  employed  in  the  literature 
to  create  model  selection  criteria.  Clearly,  (i)  produces  the  Bayes  factor,  BF,  given  in  (2). 
From  (ii),  we  obtain  the  pseudo-Bayes  factor,  PsBF  (Geisser  and  Eddy,  1979) 


n  f(yr|y(r),  M,) 

PsBF  =  - .  (4) 

n  £(yrly(r),  m2) 


From  (iv),  we  obtain  the  posterior  Bayes  factor  (Aitkin,  1991),  PoBF, 
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f(y|y»  m.) 

PoBF  = - - . 

f(y|y>  m2) 


(5) 


Finally,  from  (vi)  we  can  develop  several  versions  of  an  intrinsic  Bayes  factor  (IBF) 
(Berger  and  Pericchi,  1992).  If  r  is  the  dimension  of  the  minimal  subset  and  Sj ,  /  =  1, 

2, indexes  the  subsets  of  size  r  of  J^,  then  the  objects  in  (vi)  are  of  the  form 

f(y  c  |  7g  >  M-),  where  S  j  denotes  the  complement  of  Sj  relative  to  JQ.  We  might 
S/  1 

consider  the  ratio  of  the  averages  of  such  forms, 


S  f(ysc|ySj,  M,) 

E  f(ysc|yS;,  M2) 

or  the  average  of  the  ratios, 

Mi> 

Il\ — lrt  l  l 


=  BF 


s  f(ySj|M2) 


-1 


-1 


(7) 


(“) 


f(yse|yS(’  m2) 


=  BF  •  (“)-1E 


t(ySilM2) 


(8) 


Since  (°)  can  be  quite  large,  Berger  and  Pericchi  suggest  instead  averaging  on  the  right 
hand  sides  of  (7)  and  (8)  over  a  random  sample  of  subsets  of  size  r  from  Jn. 

5.  LAPLACE  METHOD  ASYMPTOTICS 


We  now  investigate  the  asymptotic  behavior  of  (3).  Asymptotics  are  over  the 
sample  space  through  the  sampling  distribution  of  estimates  of  0.  as  n  -*  ®.  We  employ 

Laplace  method  approximations  as  in  Tierney  and  Kadane  (1986)  whose  form  depends 
upon  Sj  and  Sj.  Let  C(S)  denote  the  cardinality  of  the  set  S.  As  n  -*  to  we  have  three  cases 

(a)  CfSj)-*®,  C(S2)  fixed 

(b)  C(SX)  fixed,  C(S2)^« 

(c)  C(Sj)  -*  »,  C(S2)  -*  ®. 

Examples  (i)  and  (vi)  of  section  4  fall  into  case  (a),  (ii)  and  (iii)  into  case  (b);  (iv)  and  (v) 
into  case  (c).  For  case  (a)  asymptotics  apply  only  to  the  numerator  in  (3).  For  cases  (b) 
and  (c)  they  apply  to  both  numerator  and  denominator. 

The  basic  Laplace  approximation  is 


|emh^d«=emh(^  (2t)P/2  m-P/^-H-Vs)^  +  0<.m~l)  (9) 

where  0  is  pil  with  h  having  unique  mode  0  and  E(0)  is  a  pip  positive  definite 
matrix  such  that  (11(0))^  =  cPh(0)/ 60^60-^.  For  a  ratio  of  integrals  with  g(ff)  >  0 

K^2i]  V  ok-2)  do 


-H  (0)| 


where  mh*(0)  =  mh(0)  +  log  g(0)  with  h*  having  unique  mode  P  and  H*(0)  is  pip 
positive  definite  matrix  such  that  (H*( ^5)^  =  d^h*(0)/d0^d0^. 

We  now  apply  (9)  and  (10)  to  (3).  In  case  (a)  we  use  (9)  for  the  numerator  of  (3) 
with  m  =  C(Sj)  and 

mh^)  =  log  L( ys  ,  M.)  +  log  L(ft,  ys  ,  M{)  +  log  x.^).  (11) 

X  2 

Let  ft  (Sp  S2)  be  the  mode  of  (11).  Then  for  case  (a)  we  have: 

f(iSl^s2'  Mi)  =  iSl'  Mi)-L(*i(si.sO;  Is2'  Mi)-’ri(¥si-s2)) 

•  (2T)Pi/2(C(S1))~Pi/2|-H;1(«i(S1,S2))  I  4-  (f(ys  iM,))-1.  (12) 

2 

(12)  has  0(n_1)  accuracy.  In  case  (b)  we  take  g(^)  =  L(^;  yg  ,  hi)  with  m  =  C(S2)  and 


mhj (ty  =  log  Ufy  7S  ,  M.)  +  log  x^). 

2 


Let  ^j(S2)  denote  the  mode  of  (13).  Applying  (10)  we  have  for  case  (b)  f(jg  |jg  ,  M.)  « 

1  2 


l(VS1»S2^;  7S2’MP’  ^V8!*^  [|-Hf1(^i(S1>S2))| 


l(^(S2);  7g  >  Mj)  •  x^i(S2))  [\-E^  (^(S2))| 

4 


(14)  has  0(n  )  accuracy.  To  handle  case  (c)  assumptions  regarding  the  rates  at  which  the 
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C(Sj)  -*  <d  as  n  -•  od  are  required.  If,  as  in  examples  (iv)  and  (v),  we  have  lim 

n-*0D 

C(S1)/C(S2)  =  k,  then  the  same  approximation  as  in  case  (b)  arises. 

Suppose  the  usual  regularity  conditions  hold  on  the  likelihoods,  L (0.;  y,  M.),  so 
that  if  ft  is  the  maximum  likelihood  estimator  of  ft  based  upon  a  sample  of  size  n  we 

ljll  1 

Aog 


have  0 \  „  — P-»  ft  for  some  ft  „  and  n  V- 
i,n  i,o’  i,o  ' 


d 0. 


-)  (I(^))jk  where 


1(0)  denotes  Fisher’s  information  matrix.  Suppose  we  assume  that  ft( S^So)  maximizes 

1  L  £ 

"  A 

(11)  with  log  x.(ft.)  deleted  and  that  ^(Sj)  maximizes  (13)  with  log  fij(0p  deleted. 
Then,  provided  C(S2)  =  0(n),  we  also  have  •  =  ft  n  +  Op(n— *)  where  •  can  be  any  of 
^(Sj),  ^(SjjSg),  ftj(S2)  or  ft^SpSg)  whence  all  of  these  modes  differ  by  Op(n_1). 
Moreover,  if  t.  is  continuous  t.(-)  =  r(  ft  n)  +  Op(n_1).  Finally,  -H(ft)  — 2-*  I(ft), 
and,  in  fact,  -H(ft^n)  I(ft>0).  As  for  in  case  0>),  L(-i  7$  ’  is 

asymptotically  negligible  so  -H*(ft)  -P-»  I(ft.)  and  also  -H*(ft  )  -P-*  1(0.  ).  For  case 

*  I  Ijll  ljW 

(c),  if  0(5^  =  0(n)  then  -H*^)  -E-*  21(0;)  and  also  -H*(ft  n) -£-*  2I(ft  Q).  We  can 
replace  0;  n  by  •  in  either  H  or  H*  with  the  same  result  holding.  In  fact,  substituting 
asymptotically  equivalent  estimators  of  ft  in  (14),  still  yields  0(n-1)  accuracy. 


6.  ASYMPTOTICS  FOR  VARIOUS  BAYES  FACTORS 
Applying  (12)  to  and  Mg  and  using  the  asymptotics  at  the  end  of  the 

previous  section,  we  obtain 


BF-L^l,n:7,  MP’rl^°l,n^ 
L(02,n’7,  M2^  r2^2,n^ 


1 

T-T— 


1-H2  (®2,n>Sj 

whence,  with  obvious  definition  of  K(0,  ,  02  n)» 


\  n^P2"Pl)/2 


(15) 


Po— Pi 

log  BF  s  log  An  +  log  n  +  K (0ln,  02n) 


(16) 


Expression  (15)  appears  in  Kass  and  Vaidyanathan  (1992).  Expression  (16) 
precisely  reveals  the  difference  in  asymptotic  behavior  between  An  and  BF  in  the  nested 
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case;  K  =  0(1)  so  the  Bayes  factor  will  be  consistent  but  will  exhibit  Lindley’s  paradox. 
Also,  apart  from  K  (16)  yields  Schwarz’s  (1978)  BIC  adjustment  of  An-  If  we  ignore  K  in 

choosing  a  model,  we  can  be  misled  since  K  need  not  be  negligible  (see  section  8).  Such 
concerns  are  pertinent  to  the  ensuing  approximations  for  the  other  variants  of  the  Bayes 
factor  and  encourage  exact  calculation  as  discussed  in  section  7. 

The  pseudo  Bayes  factor  provides  an  interesting  case  for  asymptotic  approximation. 
It  is  built  from  the  cross  validation  distributions  *(yriy(r)>  M-)  which  lend  themselves  to 

a  variety  of  approximations  based  upon  (14)  and  the  different  but  asymptotically 
equivalent  estimators  of  0.  discussed  above.  Suppose,  for  instance,  in  the  numerator  of 

(14)  we  replace  ^(SpSg)  by  &  n  and  in  the  denominator  we  replace  ^(Sj)  by  0-^, 

the  MLE  of  0.  based  upon  the  data  with  yr  removed.  We  obtain 

[|-H71(*i^)|]  '  (1?) 

Since  the  second  and  third  ratios  on  the  right  hand  side  of  (17)  tend  to  1,  we  also  have 

n  f(yj!V  m;) 

f(y,|y/rv  m.)  =  J — - — ---r-T - -  (is) 

1  w  1  n  'fri  Vi  '  Mj) 

j#r  J  1 

and 

n  Kyjlij^Mj) 

n  f(y  |y/rV  m.)  «  n  -i - -Wr - - .  (i9a) 

r  r  W  >  r  nffyJJ^,  Mj) 

j*r  i  ’ 

Two  obvious  simplifying  approximations  of  (19a)  are 

n  f(yrly(r),  ms)  « n  f(yr|0i>n,  m.)  =  L(0in;  y,  m.)  (i9b) 


(19b) 


n  f(yr|y(r),  Mi)  *  n  ffyrlij^,  Mi) . 


(19c) 


Stone  (1977)  and  Geisser  and  Eddy  (1979)  discuss  these  approximations  calling  (19b)  the 
predictive  likelihood  and  (19c)  the  quasi-predictive  likelihood.  Let  us  compare  all  three 
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approximations.  Stone  looks  at  (19b)  and  (19c),  and  argues  that 

lim  (log  n  f(y  |  ftW  Mj)  -log  n  f(y  |  \  ,  M,))  =  -p,. 
n-*ai  r  ’  r  * 

We  can  readily  compare  (19a)  and  (19c).  After  some  manipulation 

WVMi> 

log  n -]-  T  ,'.  7f  7 - loe  n  f(yr  V  Mi) 

1  1  ' 

=  S  (?  log  \B,  Mj)  -  S  log  f(yr  |  Mj)) 

^  j  j 

=  -■1  £  (i.W  -  0.  )T  -  i  )  (20) 

7  i,n  i,n;  iv  i,n  J  K  i,n  i,ny  v  1 

where  H.  is  the  Hessian  matrix  of  L(0.;  y,  M.)  and  8^*)  lies  between  O-^  and  0 
l  '  v  *  v  l  ,n  i,n  i,n 

By  standard  argumentation  for  quadratic  forms  we  may  show  that,  as  n  -*  m,  (20)  -*  pj/2. 

Finally,  using  the  definition  (4),  the  approximation  (19a),  and  the  above  limiting 
relationships  between  (19a),  (19b)  and  (19c)  we  obtain 

log  PsBF  «  log  An  +  (21) 

Hence,  using  the  approximation  (19a)  we  obtain  the  Nelder  and  Wedderburn  (1972) 
adjustment  of  Aq  as  in  Section  2.  Stone  observes  that  using  the  approximation  (19c)  we 

obtain  the  customary  AIC  adjustment  («  =  1)  of  Aq  (Akaike,  1973). 

Turning  to  the  PoBF  and  recalling  earlier  discussion  about  the  case  (c) 
asymptotics,  suppose  in  (14)  we  replace  ^(SpSj)  and  0t(S2)  with  0-  Q.  Then  we  obtain 

(P2-Pi  )/2 
PoBF  »  An  2  1  1 

and 

log  PoBF  s  log  An  +  -  — ip  -  log  2.  (22) 

Result  (22)  along  with  some  additional  asymptotic  calculations  are  provided  in  Aitkin 
(1991).  Hence,  the  correcton  of  the  likelihood  associated  with  the  PoBF  falls  below  that  of 
Nelder  and  Wedderburn  which,  in  turn,  falls  below  the  customary  AIC  adjustment. 
Neither  the  PsBF  nor  PoBF  will  suffer  the  Lindley  paradox. 
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We  next  consider  the  IBF.  From  expressions  (7)  and  (8),  given  (15)  or  (16)  the 
asymptotic  behavior  of  the  IBF  depends  upon  that  of  the  right  hand  side  summations. 
Though  the  components  of  the  sums  are  not  necessarily  independent,  in  many  cases  a  "law 
of  large  numbers"  argument  will  produce  a  limiting  constant  whence,  asymptotically  the 
IBF  behaves  like  a  multiple  of  the  Bayes  factor. 

7.  EXACT  CALCULATIONS  USING  MONTE  CARLO  METHODS 


The  accuracy  of  the  previous  analytic  approximations  is  unknown  in  practice. 
Additionally,  these  approximations  do  not  produce  functional  forms  since  required  modes 
can  rarely  be  obtained  as  explicit  functions  of  the  data.  Rather,  for  a  given  observed 
sample,  O' s  of  O' s  would  be  calculated  numerically  yielding  a  numerical  value  for  (3). 
Therefore,  we  can  not  study  the  behavior  of  or  features  of  such  predictive  distributions.  A 
sampling— based  approach  is  attractive  in  avoiding  the  above  difficulties.  Such  simulation 
approaches  might  be  noniterative  as  in  standard  Monte  Carlo  (see  e.g.  Geweke,  1989)  or 
iterative  as  for  example  using  the  Gibbs  sampler  or  other  Markov  chain  Monte  Carlo 
techniques  (see  e.g.  Gelfand  and  Smith,  1990;  Tierney,  1991).  We  present  no  detail 
regarding  these  techniques.  Rather  we  just  describe  how  they  enable  arbitrarily  accurate 
estimates  of  (3)  as  a  function  of  yg  or  of  expectations  associated  with  (3). 

Suppose  g(fl-)  is  taken  as  an  importance  sampling  density  for  L(0.;  yg  ;  Mi  W 

2 

If  fly,  j  =  1,-.,B-  is  a  sample  from  g  and  we  define  w.j  =  L(0Jj;  yg^;  M^)  *j(0jj)/g(*jj) 
then  a  Monte  Carlo  integration  for  (3)  is 


(23) 


If  a  Markov  chain  Monte  Carlo  technique  has  been  used,  the  output  is  usually  taken  to  be  a 
sample  0jj,  j  =  from  the  posterior  x-(^|y).  But  then  we  can  take  the  posterior  as 

the  importance  sampling  density  in  (23)  resulting  in  the  approximant 


f(7Sll7s2'Mi> 


£ 

j 


(24) 


where  Sg  «  -  Sg.  The  estimator  (24)  is  routine  to  calculate  and  simulation  consistent. 
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£  _ 1 

However,  its  precision  depends  upon  the  stability  of  the  weights  w-.  =  (L(0f yc  ,  M.)  , 

ij  ij  ^2  i 

that  is,  upon  how  good  an  importance  sampling  density  t  (0.  |y)  is  for  t  (0-  |yQ  ).  In  the 

ii  1  1  a2 

case  of  the  PoBF  it  is  perfect  and  the  resulting  estimate  takes  the  simple  form 


B, 


B 


x-1 


1 


(S  1(0*2 ji  7S,.  M2)/B2)  .IMfy  7s,’  Mi)/Br  <25) 

In  the  context  of  cross  validation  we  would  expect  X|(^|y)  to  be  a  good  importance 
sampling  density  for  each  ^(^ly^p  and  (24)  becomes  a  harmonic  mean 

n-1 


) 


(26) 


from  which  the  PsBF  can  be  straightforwardly  calculated. 

Consider  the  special  case  of  f(y,  Al).  Here,  if  r(ft)  is  a  proper  density, 


1  r(  °i ) 

(f(7,  M,))-1  =  J 


l(«ii  7, 


*i(*|y)  dfl. 


Thus,  our  estimator  becomes 


f(7,  M  )  =  (^£ 


Mfy  7,  M.)x.(«fj) 


r1. 


(27) 


In  (27)  r  plays  the  role  of  an  importance  sampling  density  and  natural  choices  to  "match" 
the  posterior  would  be  multivariate  normal  or  t  densities  with  mean  and  covariance 
computed  from  the  0^’s.  If  it  is  proper  we  could  take  it  as  r  obtaining  an  estimator  of 

f(y;  Mj)  proposed  by  Newton  and  Raftery  (1991). 

Finally,  suppose  f(yg  |yg  ;  Mj)  is  proper  and  we  seek  the  expectation  of  h(yg  ) 

12  1 

with  respect  to  this  density.  Suppose  the  conditional  expectation  a^)  =  E(h(yg  )\0-x, 

Mj)  with  respect  to  L (0.;  yg  ,  M^)  is  available  explicitly.  Then  since  Eh(yg  |yg  ;  M-) 

1  12 

-  /*!(*!>  (*il7S2)  = 
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rMlysJ  r.(0.|y) 

"■  ‘  V*  -  w*  5^775;  '< 

S2 


a  Monte  Carlo  integration  for  this  expectation  using  (24)  takes  the  form 


E(%s  )|ys  ;  Mj)  =  [E  J  Mj)]/  E  l/tyrf  j  c,  Mj)  (30) 

1  2  j  Sg  j  Sg 

If  a1(^i)  is  unavailable  explicitly,  but  y£  ,  j  =  1,...,B.  is  a  sample  from  f(yg  | yg  ; 

l,j  1  2 

then  E  h(y~  Cyq  ;  M-)  =  Bj-1  S  h(yA  ).  To  draw  yA  -  f(yc  |yc  ,  M.)  it  suffices  to 
bl  b2  1  1  j  bl,j  bl,j  S1  b2  1 

draw  yA  -  L( O'. .;  yq  ,  M.)  where  0--,  are  a  sample  from  t.(0.  |yQ  ).  Smith  and 
°l,j  °l  1  11  32 

Gelfand  (1992)  discuss  converting  a  sample  from  one  posterior  to  that  from  another.  In  the 

present  case,  we  must  convert  the  0?j  from  x.(tf.|y)  to  0- j  from  ;r(0.  |jg  ). 


8.  A  NUMERICAL  EXAMPLE 


We  consider  a  data  example  where  the  objective  is  to  choose  between  nested 
nonlinear  models.  We  employ  both  asymptotic  results  (Section  6)  and  exact  calculation 
(Section  7)  to  numerically  compare  and  the  various  Bayes  factors.  The  data  concerns 

the  steady  state  adsorption  of  o— xylene  as  a  function  of  oxygen  concentration,  inlet 
o-xylene  concentration  and  temperature.  The  sample  of  57  points  is  presented  along  with 
the  full  model  in  Bates  and  Watts  (1988,  p.  306-309). 

The  full  model  Mg  is,  in  fact,  yj  =  b^Xj,  0|Mg)  +  Ej  where  x.  is  3*1  and  the 

error  €j  is  assumed  independent  N  (0,  <r 2).  Here  b^(x,  0)  =  b1bg/(b1  +  .22788  b2)  with 

b1(x,  0)  =  exxx  exp(-03/xg)  and  bg(x,  0)  =  0gXg  exp(-04/x3).  A  convenient  reduced 
model,  Mp  sets  0j  =  04  yielding  yj  «  br(xj,  0^)  +  €j  where  br(x,  0|Mj)  =  ^^XjXg 
®rp(— 03/X3)/ ( 0^1  +  2.2788  0gXg).  The  modeling  implicitly  assumes  that  all  0|  >  0. 
Hence,  we  take  a  flat  prior  on  0.  =  log  0.  independent  of  the  prior  r{c  )  —  c  on  0  . 

A  nonlinear  regression  fitting  package  (SAS  PROC  NLIN)  was  used  for  Mj  and  for 
Mg  to  obtain  the  maximum  likelihood  estimates  of  all  the  parameters  and  thus  to  compute 
the  likelihood  ratio  statistic  An  *  (7g/tfj)n^exp{pj—  Pg)/2}  which  turns  out  to  be  .004. 
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The  maximum  likelihood  estimates  and  their  asymptotic  covariance  were  used  to  obtain, 
for  each  model,  a  multivariate  t-distribution  which  served  as  an  importance  sampling 
density  for  a  noniterative  Monte  Carlo  approach.  ''Exact"  calculations  based  on  10,000 
simulations  along  with  asymptotic  approximations  are  given  in  Table  1. 


!°e  A„ 

log  BF 

log  PSBF 

log  PoBF 

"Exact" 

Calculation 

-5.52 

—4.94 

-5.13 

-5.27 

Asymptotic 

Approximation 

— 

-3.50 

-5.02 

-5.17 

Table  1.  Monte  Carlo  estimates  and  asymptotic 
approximations  for  model  selection  criteria 

Table  1  indicates  that  regardless  of  the  criterion,  the  full  model  is  overwhelmingly 
selected.  The  asymptotic  approximation  for  BF  is  poor  suggesting  that  for  such  very 
nonlinear  models  the  sample  size  57  is  not  large  enough;  K  in  (16)  is  not  negligible.  More 
precise  approximation  using  (15)  is  nontrivial  in  the  present  case  and  seems  hard  to  justify 
given  the  relative  ease  with  which  the  simulation  approach  can  be  implemented. 

9.  CONCLUDING  REMARKS 

Our  effort  here  has  focused  on  modeling  situations  where  p.  <  <  n,  so-called 

regular  models.  Though  this  encompasses  a  broad  range  of  classical  problems,  much  of 
contemporary  Bayesian  modeling  considers  hierarchical  or  structured  random  effects 
models.  In  such  cases  p^  or  pj—  p^  can  tend  to  ®  as  n->®.  Among  the  disasters  which 

befall  us  in  such  nonregular  problems  are:  i)  all  of  the  asymptotics  presented  here  break 
down  ii)  parameters  may  not  be  consistently  estimated  (a  matter  which  was  recognized  as 
early  as  Neyman  and  Scott,  1948)  iii)  under  improper  priors  the  posterior  need  not  be 
proper  (a  form  of  Bayesian  nonidentifi ability).  Attention  to  the  prior  specification  can 
remedy  (ii)  and  (iii)  but  not  (i).  Hence,  the  model  choice  criteria  discussed  here  require 
exact  calculation  through  the  approaches  of  Section  7.  In  this  regard  a  valuable 
supplement  to  these  experiment-levd  criteria  is  investigation  of  model  performance  at  the 
observation  or  case  level. 
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SUMMARY 

Model  determination  is  a  fundamental  data  analytic  task.  Here  we  consider  the 
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these  approaches.  Concern  regarding  the  accuracy  of  these  approximation  for  small  to 
moderate  sample  sizes  encourages  the  use  of  Monte  Carlo  techniques  to  carry  out  exact 
calculations.  A  data  set  fit  with  nested  non  linear  models  enables  comparison  between 
proposals  and  between  exact  and  asymptotic  values. 


